This dataset was obtained from Kaggle here and includes the lat/lon coordinates of meteorite impact sites.
There 45,000 impact sites and in this analysis we will be analyzing the ones inside of the continential United States.
First we need to check for data integrity. If we count the NA’s we can see that they are most frequent in the lat/long, year, and then mass.
colSums(is.na(meteorites))
## name id nametype recclass mass fall
## 0 0 0 0 131 0
## year reclat reclong GeoLocation
## 288 7315 7315 0
We can confirm this visually by passing our dataset to a missing values map from the ‘Amelia’ package.
missmap(meteorites,
main = "Missingness Map of Meteorite Impact Dataset",
legend = T,
y.labels = NULL,
y.at = NULL,
col = c("#ff6961", "#84d9ff"))
A missingmap is a quick and easy way for us to gauge the missingness of our dataset however since it is visually based we can miss some details on a first glance, such as the missingniss along year and mass columns. If we run this code below we will get a raw count of the NA values per column.
sapply(meteorites, function(y) sum(length(which(is.na(y))))) %>%
data.frame() %>%
dplyr::rename(NA_COUNT = 1) %>%
dplyr::arrange(desc(NA_COUNT))
## NA_COUNT
## reclat 7315
## reclong 7315
## year 288
## mass 131
## name 0
## id 0
## nametype 0
## recclass 0
## fall 0
## GeoLocation 0
We can calculate the percentage of missing values by dividing the sum of NA’s by the product of rows multiplied by columns for all data points, which returns us 3.2%.
\[NA= \frac{\Sigma(NA)}{\Pi(Rows * Column)}\]
sum(is.na(meteorites))/prod(dim(meteorites))
## [1] 0.03291845
This number is low enough where we can simply filter these data-points out of our dataset and proceed without the need over imputation. We remove the missing values from our dataset and run a missing values map to confirm this as shown below.
# Filtering out missing values
meteorites_clean <- meteorites %>%
dplyr::filter(!is.na(year) & !is.na(mass) & !is.na(reclat) & !is.na(reclong))
missmap(meteorites_clean,
main = "Missingness Map of Meteorite Impact Dataset",
legend = T,
y.labels = NULL,
y.at = NULL,
col = c("#ff6961", "#84d9ff"))
If we run summary() We can see below two (2) main issues:
summary(meteorites_clean)
## name id nametype recclass
## Length:38116 Min. : 1 Length:38116 Length:38116
## Class :character 1st Qu.:10832 Class :character Class :character
## Mode :character Median :21733 Mode :character Mode :character
## Mean :25343
## 3rd Qu.:39887
## Max. :57458
## mass fall year reclat
## Min. : 0 Length:38116 Min. : 601 Min. :-87.37
## 1st Qu.: 7 Class :character 1st Qu.:1986 1st Qu.:-76.72
## Median : 29 Mode :character Median :1996 Median :-71.50
## Mean : 15600 Mean :1990 Mean :-39.59
## 3rd Qu.: 187 3rd Qu.:2002 3rd Qu.: 0.00
## Max. :60000000 Max. :2101 Max. : 81.17
## reclong GeoLocation
## Min. :-165.43 Length:38116
## 1st Qu.: 0.00 Class :character
## Median : 35.67 Mode :character
## Mean : 61.31
## 3rd Qu.: 157.17
## Max. : 178.20
First, We can see some meteorites have a mass of 0 grams, this could be because the actual mass was so small it did not register past a decimal point. To correct this we can filter these objects out.
meteorites_clean <- meteorites_clean %>%
dplyr::filter(mass != 0)
We will use 2016 as our ‘ceiling’ or the max point we want to find out. We can remove these rows and then make a density plot to see where our ‘floor’ should, that is to mean the lowest values we want filtered out.
meteorites_clean <- meteorites_clean %>%
dplyr::filter(year <= 2016)
density <- density(meteorites_clean$year)
plot_ly(x = ~density$x, y = ~density$y, type = 'scatter',
mode = 'lines', fill = 'tozeroy',
height = 500, width = 910,
fillcolor = "#DB4437",
line = list(color = 'white')) %>%
layout(title = "Meteorite Impact Frequency over Time (600 - 2016)",
hovermode = "x-unified",
titlefont = list(family = "Agency FB",
size = 25,
color = '#ffffff'),
font = list(family = "Agency FB", size = 15),
margin = 10,
paper_bgcolor='black',
plot_bgcolor='black',
xaxis = list(title = "Year",
color = '#ffffff'),
yaxis = list(title = "Density of Impacts",
color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.
We can see the number of meteorite impacts increases over time as both our understanding and technology improves, however we see a significant jump approx after 1960’s. With this in mind we’ll filter out for values before 1960, this choice is arbitrary admittedly but will give us a clearer picture of ‘recent’ meteorite impact sites in our research area.
meteorites_clean <- meteorites_clean %>%
dplyr::filter(year >= 1960)
Here is what our new distribution looks like as filtered from 1960 - 2016 (76 yrs). We can see three (3) definite peaks in our density distribution which all occur in the late years of the decades they are in (late 70’s,late 80’s, and late 90’s)
density <- density(meteorites_clean$year)
plot_ly(x = ~density$x, y = ~density$y, type = 'scatter',
mode = 'lines', fill = 'tozeroy',
height = 500, width = 910,
fillcolor = "#4285F4",
line = list(color = 'white')) %>%
layout(title = "Meteorite Impact Frequency over Time (1960 - 2016)",
hovermode = "x-unified",
titlefont = list(family = "Agency FB",
size = 25,
color = '#ffffff'),
font = list(family = "Agency FB", size = 15),
margin = 10,
paper_bgcolor='black',
plot_bgcolor='black',
xaxis = list(title = "Year",
color = '#ffffff'),
yaxis = list(title = "Density of Impacts",
color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.
Next we need to filter out for this impacting the continental united states. To achieve this we’ll need to select the points for the north/south and east/west boundaries of the continental
# hacky method using the northern/southern, and eastern/western most coords to create a box to filter on
meteorites_usa <- meteorites_clean %>%
dplyr::filter(reclat <= 49.3457868 & reclat >= 24.7433195) %>% #filter for north and south
dplyr::filter(reclong >= -124.7844079 & reclong <= -66.9513812) # filter for east and west
With our contemporary US state boundary file we will need to convert our dataframe to an sf object to perform the operation.
Now with out dataset properly cleaned and prepped we can move onto the Point-Pattern-Analysis (PPP) step.
First we’ll map out all 971 points on a map to see how the data appears.
Here we have mapped out the impact sites, color coding and scaling the size of each point by the mass (g).
pal <- colorNumeric(palette = "RdBu", domain = meteorites_usa$mass, reverse = T)
leaflet(data = meteorites_usa, height=500, width=910) %>%
addFullscreenControl(position = "topright") %>%
addResetMapButton() %>%
addProviderTiles(providers$Stamen.TonerLite)%>%
addCircleMarkers(lng = ~reclong, lat = ~reclat,
color = ~pal(mass),
radius = meteorites_usa$mass/1e5,
stroke = FALSE, fillOpacity = 1) %>%
addLegend("bottomright", pal = pal, values = ~mass,
title = "Mass (grams) <hr>",
labFormat = labelFormat(suffix = " g"),
opacity = 1)
We can see some clustering along the border between New Mexico & Texas but mostly it’s sporadic and random everywhere else. We also see an impact in california that is massive relatively speakin, clocking in at a mass of 2,753,000 grams
Below we have boxplotted the points to gauge the frequency. We can see that the mass clusters below 50K grams, with the median being ~300g.
plot_ly(meteorites_usa,
height = 500, width = 910) %>%
add_boxplot(y = ~mass, jitter = 0.5, pointpos = -1.8, boxpoints = 'all',
marker = list(color = '#36C5F0'),
line = list(color = '#FFFC00'),
name = "mass(g)") %>%
layout(yaxis = list(title = 'mass(g)')) %>%
layout(plot_bgcolor='black') %>%
layout(paper_bgcolor='black')
Here we have produced a heatmap to more clearly illustrate clusters for impact sites. We can see clearly that our original site is indeed a cluster but also that along California/Arizona is also one for consideration.
leaflet(data = meteorites_usa, height=500, width=910) %>%
addFullscreenControl(position = "topright") %>%
addResetMapButton() %>%
addProviderTiles(providers$Stamen.TonerLite) %>%
addHeatmap(lng = ~reclong, lat = ~reclat,
gradient = "magma",
minOpacity = 0.1,
cellSize = 1,
radius = 3, blur = 15)
In our dataset meteorites are coded via whether they were spotted on the ground after impact (“found”) or while in the mid-fall ("fell)
Overwhelmingly we can see most meteorites were spotted while on the ground after-impact was made, we have modified the size argument to help make the quantity more acute on our map.
pal <- colorFactor(c("#FF1C51", "#006AFF"), domain = c("Fell", "Found"))
labels <- c("Mid-Air", "Ground")
leaflet(data = meteorites_usa, height=500, width=910) %>%
addFullscreenControl(position = "topright") %>%
addResetMapButton() %>%
addProviderTiles(providers$Stamen.TonerLite) %>%
addCircleMarkers(lng = ~reclong, lat = ~reclat,
color = ~pal(fall),
radius = ~ifelse(fall == "Found", 2, 4),
stroke = FALSE, fillOpacity = 1) %>%
addLegend("bottomright", pal = pal, values = ~fall,
title = "<u>Found Status</u>",
opacity = 1,
labFormat = function(type, cuts, p){
paste0(labels)})
Here we can interestingly that the more recent impact sites are now impacting California/Arizona and that the datapoints along New Mexico /Arizona are mostly ‘older’ impact sites.
pal <- colorBin(
palette = "RdBu",bins = 5,
domain = meteorites_usa$year, reverse = T)
leaflet(data = meteorites_usa, height=500, width=910) %>%
addFullscreenControl(position = "topright") %>%
addResetMapButton() %>%
addProviderTiles(providers$Stamen.TonerLite) %>%
addCircleMarkers(lng = ~reclong, lat = ~reclat,
color = ~pal (year),
radius = 3,
stroke = FALSE, fillOpacity = 1) %>%
addLegend("bottomright", pal = pal, values = ~year,
title = "<u>Year Found</u>",
labFormat = labelFormat(),
opacity = 0.9)
plot_ly(data = meteorites_usa,
x = ~year,
type = "histogram",
cumulative = list(enabled=TRUE),
marker = list(color = '#006AFF'),
name = "Cumulative",
height = 500, width = 910) %>%
add_trace(data = meteorites_usa,
x = ~year,
type = "histogram",
cumulative = list(enabled=F),
marker = list(color = '#00ffff'),
name = "Per Year") %>%
layout(title = "Meteorite Impacts in America (1960 - 2016)",
titlefont = list(
family = "Agency FB", size = 45, color = '#ffffff'),
font = list(family = "Agency FB", size = 25),
margin = 10,
paper_bgcolor='black',
plot_bgcolor='black',
xaxis = list(title = "Year",
color = '#ffffff'),
yaxis = list(title = "No. of Impacts",
color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.
one of a large number of meteorite classifications based on physical, chemical, and other characteristics
Meteorites have a classification system based on the mineralogical, petrological, chemical, and isotopic properties of the meteorite. Here we have calculated the frequency for each class in our dataset and filtered for the top 10 most represented classes in our data. We visualize this distribution in a pie chart.
Our top 10 most frequent classes by number of impacts were: 1. H5 (202) 2. L6 (164) 3. H4 (125) 4. H6 (93) 5. L5 (92) 6. 0C (37) 7. L4 (36) 8. Iron, IIIAB (20) 9. CK4 (11) 10. LL6 (10)
# Create a frequency for our class variable
class_freq_table <- meteorites_usa %>%
dplyr::group_by(recclass) %>%
dplyr::tally() %>%
dplyr::ungroup() %>%
dplyr::arrange(desc(n)) %>%
dplyr::top_n(n = 10, wt = n)
colors = brewer.pal(10, "PuOr")
plot_ly(class_freq_table, labels = ~recclass, values = ~n, type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
height = 500, width = 910,
marker = list(colors = colors,line = list(color = 'white', width = 2)),
showlegend = FALSE,
hoverinfo = 'text',
text = ~paste('<b>Meteorite Classification:</b>', recclass, "<br>",
"<b>No. of Impacts:</b>", n, "</br>",
"<b>No. of Impacts:</b>", round(n/sum(n)*100, 1),"%", "</br>")) %>%
layout(title = 'Meteorites Classes in USA (1960 - 2016)',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)) %>%
layout(titlefont = list(
family = "Agency FB", size = 45, color = '#ffffff'),
font = list(family = "Agency FB", size = 25),
margin = 10,
paper_bgcolor='black',
plot_bgcolor='black')
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.